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Thep«serth»e«i<»<d-e»«o*«fieMof^r«og»itions^ 

^fioiOly .» ae.adaptaa.mrf. spc«ih«cognWon system to varj^g eav«oB««M - 

5 conditions. 

Speech^cogtation system, t^Bscribe a spol^n dic^^^ 

rf te* gB.«rati<« ftom speech c«. Wdoattyte divided imo fte stq« rft^^ 

soutui signal, pre-processing and perS»mi»g a signal analysis, recognition of analyzed 
10 signals and ouWBtting of recognized tral. 
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The lecdving of a soand signal is provided by arw means of recording, as eg. 

.rioc^one. In fte signal analydnBSt^fl»"c«ved sound signal is W<=d^^ 
™^ into ame windows covering a time interval Wicany in the range of seve«^ 

rfDiseconds. By means of a Fast Fourier Transfix ^ «» n«=tnm> 

time window is «m^ Furfter a smootog ftnction with typically trim^le shaped 

ke.«bis«pUedtofl«power^«ctrumandgenerrtesafeani.evector.Thesmgle 

e,™^ of ,!« ii^ature vector represent distinct por*^ 

.„ daracterisac fcr content of speech and fhereixm ideally suited ii« spe^ 

^cognition purpose. Furthermore . logaridunic iimcdon is appHed to all components of 
4ef6aturev«=tor«sultinginfe.turevect«^otalog.,«ctraldom.fi.The,ngnal 

a^ysis step may fto^er compriae «> environmental adaptation as weU as add.««»a 
^eps, as e.^ applying a cepstral transfimnadon or adding derivatives or regressron 
deltas to the feature vector. 



25 



ared with reference signals derived 

ar 



In the recognition step, the analyzed signals are conq 

ft«^1xaining speech sequences being assignedto a vocabulary. Furthermore grann. 

rules as weU as context dependent com 
text is ouq?utted in a last step. 



ands can be performed befiwe the recognized 
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Environmeiital adaptation is an important step within a signal analysis procedure. 
Essential sources of CTivironmental mismatch between trained speech reference and 
recognition data are for example di£ferent signal to noise ratios, different recording 
channel noise or different speech-and-silence proportions. 

US Pat No. 5,778,340 discloses a speech recognition sj^m having an adaptation ~ 
function. Here speech input is converted into the feature vectors series which is fed to a 
preliminary recognizer. The preliminary recognizer executes preliminary recognition by 
calculating similarity measures between the input pattern and a reference pattern stored 
in a reference pattern memory. In this way top candidates are determined by means of 
the calculated similarity measures. A reference pattern ad^tor executes adeqptation of 
flie reference patterns based on the reference patterns, the input pattern, the top 
candidates and newly stores the adapted reference pattern in the reference pattern 
memory. A final recognizer then executes tiie speech recognition of the iixput pattem by 
using the newly stored reference patterns corresponding to the top candidates. 

The adaptation means comprise the separation of an input pattem in speech periods and 
noise periods. Noise periods correspond to sound intervals of a speech discontinuity. US 
Pat. No. 5,778,340 ferther discloses a calculation of mean spectra for noise and speech 
periods of the reference and input patterns. The adaptation of either input or reference 
pattem is then performed by means of some sort of ad^tation function making use of 
the calculated spectra. Anyhow this method is based on a hard decision whether a sound 
interval represents speech or noise. Depending on the received sound signal and the 
additional noise such a decision cannot be made unambiguously. In some critical 
situation the underlying system may therefore interpret a noise period as a speech period 
and vice versa. 

US. Pat. No. 2002/0091521A1 describes a technique for rapid speech recognition under 
mismatched training and testing conditions. The illustrated technique is based on a 
maximum likelihood spectral transformation (MLST). Here sfpeech feature vectors of 
real time utterances are transformed in a linear spectral domain such that a likelihood of 



3 



PHDE030343 EPP 



10 



15 



20 



the utterances is increased after the transformation. The maximum likelihood spectral 

transformation estimates two parameters corresponding to convolutional noise and 

ad^tive noise in flie linear spectral domain. Afl«r the two noise parameters have been 

estimated, a transformation of the feature vectors is performed in order to increase the 

5. ._jacelillQQdJ>f testin&utterances. Since 1^ 

q,ectral domain and smce tiie d3«iamic range of speech is feirly large, 

robust determination of flie necessary parameters mig^ be difficult 

US Pat. No. 2003-0050780A1 describes a speaker ad£q?tation upon input speech that is 
provided in the presence of background noise. Here a linear ^q^ximation of a 
background noise is applied after tihe feature extraction and prior to speaker ad^tation 
to allow the system to ad^ the speech model to liie enrolling user without distortion 
ftom background noise. Here a speaker ad^tation module enq)loys an inverse linear 
^pcoximation operator to remove the efiBect of die background noise prior to 
ad^tation. The result of tiie inverse ^oximation is a set of modified observation data 
lhat has been cleaned up to remove the eflBsct of background noise. A noise 
compensated recognizer described in the US Pat. No. 2003-0050780A1 uses acoustic 
models being developed under certain noise conditions and fliat are tiien used under 
difierent noise conditions. Therefore an estimate of the noise level difference between 
Ihe at least two noise level differences must be assessed. This is typically performed by 
a feature extraction module which extracts features fi»m a pre-speech ftame before Ihe 
input speech utterance begms. 

The present invention aims to provide an inqn:oved method and apparatus for Ihe 
25 ad^tation of a speech recognition system to various environmenlal conditions. 

The invention provides a melhod of enviromnental ad^tion of a speech recognition 
system by making use of a generation of a sequence of feature vectors in flie log- 
spectral domain, die calculation of probabiUties, whedier a received sound interval 
30 represents speech or a speech discontinuity, die calculation ofmean values for speech 
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and mean values for silence intervals far speech to be recognized and training speech, 
respectively. 

Each feature vector of the sequence of feature vectors in the log-spectral domain is 
descriptiYe of a power spectrum j)f the speech to be recognized that corresponds to a _ 
time window covering a distinct time interval. The speech recognition system typically 
comprises a set of reference feature vectors that were recorded under training conditions 
for recognition purposes. The method of the invention is principally based on a 
transformation of feature vectors such that a mismatch due to different environmental 
recording conditions is minimized. 

According to a preferred emibodiment of the invention tiie method does not strictly 
sq>arate whetiier a sound interval rqprraents speech or a speech discontinuity in the 
form of silence. Instead the method detennines and calculate a probability that a sound 
interval represents speech or silence. In this way, a hard, potentially wrong, decision is 
avoided increasing the overall reliability of tiie entire speech recognition system. 

For each component of the feature vector the metiiod calculates a silence probability by 
means of a monotonous decreasing probability function. The parameter needed by the 
probability function is simpfy the modulus of the respective feature vector component. 
The larger the feature vector component the smaller the probability that the respective 
feature vector conq)onent represents a silence interval. The corresponding speech 
probability is given by the difference between the silence probability and unity. 

The method fiirther calculates a mean value for silence and speech intervals for each 
feature vector component by means of a mean fenction. On tbe basis of a subset of 
feature vectors, the mean function provides an average value for the respective feature 
vector component based on the silence and speech probabilities as weights. 
Correspondingly, the method fimher calculates mean values for silence and speech of 
the single component of the training feature vectors. The essential transformation 
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function for llie enviionmental ad^tation is then perfonned for each component of &e 
feature vectors separately on flie basis of the feature vector conq)onent itself the silence 
probabiHty of the feature vector conqwnent, iJie mean value for silence and the m( 
value for speech of the respective feature vector conqjonents of a subset of feature 
5 Yectoa.aDd»jns»n.yalu^ foLstence and a mean value for^speeah of the respective, 
feature vector conqwrnents of a subset of training feature vectors. 



Conq)arison between mean values for silence of a subset of feature vectors and a subset 
of training feature vectors gives a general indication about the noise level and/or 
difiference environmental recording conditions of the recorded speech. Similarly the 
mean values for speech of a subset of feature vectors and the subset of training feature 
vectors can be conq)ared, TypicaUy the transformation of feature vector conqionente 
makes use of fliis conq»arison in combination wifli the ptobaWlily values of the feature 
vector component. 
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According to a flirther preferred embodiment of tiie invention, a calculation of a speech 
probabiUty of each feature vector con^onent is performed. Typically Ibe method makes 
use of the monotonous decreasing probabiUly function for generating the silence 
probabiUty and subsequently subtracting the silence probability ftom the number 1. 
According to this embodiment Ae transformation of the feature vector components 
takes expUcitly into account the calculated speech probabiUly. 



generating mean 



for 

irujer preiBnea emoouiuK/uv ^ . 

values for silence and speech for the feature vector components as well 
as the training feature vector componente is reaUzed in flie ferm of a moving weighted 
average fimction. Averaging is performed over a subset of feature vectors. For example 
the mean value for sUence of a distinct feature vector component is given by tiie sum 
over tiie product of the respective feature vector conq»onents multipUed by tiie silence 
probabiUly of tiie respective feature vector components and divided by tiie sum of aU 
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respective silence probabilities, wherein the summation index is running over all feature 
vectors of the subset of feature vectors. 

The calculation of silence or speech mean values of feature vector components is 
__5-. performed for the subset ofjfeature^^ectors in the .same way as for the subset of training 
feature vectors. Both subsets typically comprise the same number of feature vectors. 
The mean valu^ of the feature vectors being permanently acquired during the speech 
recognition dynamically change and have to be recalculated during the process of 
speech recognition, whereas the mean values represmting the training feature vectors 
1 0 remain constant and can therefore be stored by some kinds of storing means. In this way 
the method dynamically adapts to varying enviroimiental conditions. This provides a 
high reliability and a high flexibility of the speech recognition system. 

According to a preferred embodiment of the invention, the subset of feature vectors for 
15 the calculation of mean values for silence and speech of feature vector components 
typically comprises a number of 10, preferably a number between 20 and 30 feature 
vectors. 

According to a further preferred embodiment of the invention, the monotonously 
20 decreasing probability fimction comprises a slope constant (□) which is descriptive of 
the slope of the monotonously decreasing probability function. In this way the 
assignment of a speech probability or a silence probability to a distinct feature vector 
conrponent can be manually adapted by variation of the slope constant (□). This is of 
extreme practical use since the speech recognition system can be manually adapted to 
25 different types of environmental noise, such as e.g. white noise or other types of more 
irregular noise patterns. 

According to a further preferred embodiment of the invention, the silence probability 
function of the mean value for silence phis the ^^ropriate variance value for silence 
30 results in a silence probability of 0.5. 
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According to a furflier prefeited embodiment of the invention, the silence probability 
function is given by a Sigmoid ftmction v^ose specific form is farther specified by: 



l+exp((MsB +Vsa -F,)oclV^) ' 



10 



where: 



M <« : mean value for silence of feature vectors. 



V , : variance value for silence of featuie vectors. 



F * feature vector coximonent. 

c' 



15 



According to a further preferred embodiment of the invention, the transfo(rmati( 
function for flie feature vector components is given by the foUowing mathematical 



model: 



20 



25 



where: 



F : transformed feature vector component, 
F^^y : feature vector component. 



mean value for silence of training feature vectors, 
mean value for speech of training feature vectors. 



: mean value for speech of feature vectors, 
M mean value for silence of feature vectors. 



: silence probability. 



speech probability . 



Furthermore tiie method for environmental ad^tafion is not only specified to feature 
vectors but it can also be ^lied to entire spectras in the log-spectral domam. 
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Furthemare the essential sources of enviionmental mismatch between trained speech 
references and recognition data like signal-to-noise ratio, recording channel and varying 
speech-and-silence proportion in the utterances are handled simultaneously. Since the 
procedure and the method provide a simple computation algorithm it is especially suited 
5 for the utiJi^on in digital signal processois.(DSP) with low lesojKses of memory and 
confutation time. 

In the following, preferred embodiments of the invention will be described in greater 
detail by making Kf&xaace to the drawings in which: 

Figure 1 shows a flow diart diagram of a s^ieech recognition system. 

Figure 2 is illustrative of a flow chart for performing an environmental adaptation. 

Figure 3 shows a monotonous decreasing probability function. 

Figure 4 shows a block diagram of a speedbi recognition system and an 

environmental ad^tation according to the invention. 

Figure 1 schematically shows a flow chart diagram of a speech recognition system. In a 
first step 100 speech is inputted into the system by means of some sort of recording 
device, such as a conventional microphone. In the next step 102, the recorded signals 
are analyzed by performing the following steps: segmenting flie recorded signals into 
flamed time windows, performing a power density computation, g^erating feature 
vectors in the log-spectral domain, performing an environmental adi^ tation and 
optionally performing additional steps. 

In the first step of the signal analysds 102, flie recorded speech signals are segmented 
mto time windows covering a distinct time interval. Then the power spectrum for each 
time window is calculated by means of a Fast Fourier Transform (FFT). Based on the 
power spectrum, the feature vectors being descriptive on the most relevant fiequen(^ 
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portions of the specttum tbat are chaiactetistic for the speech content In flie next step of 
the signal analysis 102 an environmental adaptation according to the present invention 
is performed in order to reduce a mismatch between the recorded signals and Hie 
reference signals extracted from training speech being stored m the system. 



~ Fwmeimoreadditioma st^s may be optionally performed, such as a cepstral 

Hansformation. In the next step 104, the speech recognition is performed based on the 
con«)arison between the feature vectors based on training data and the feature vectors 
based on the actual signal analysis plus the environmental adaptation. The training data 
0 in form of trained speech references are provided as mput to the speech recognition step 
104 by the step 106. The recognized text is then outputted in step 108. Outputting of 
recognized text can be performed in a manifold of different ways, such as e.g. 
displaying the text on some sort of graphical user interface, storing the text on some sort 
of storage medium or by sinqily printing the text by means of some printing device. 

15 

Figure 2 is iUustrative of the environmental ad^tation according to ihe present 
invention. The feature vectors provided by the speech recognition system are adapted to 
the specific environmental conditions. Here the single components i of each feature 
vector j are transformed in order to minimize the mismatch between feature vector 
20 components generated from received speech and feature vector conq?onents of training 
data. 

In the first step 200, a feature vector (j = 1) is selected. In the next step 202 a smgle 
qponent (i = 1) of feature vector j is selected. The selected feature vector component 
25 is then passed to step 204 in which a silence probability of the feature vector component 
is calculated according to the probaWUty function. In step 206, the ^lopriate speech 
probability of the feature vector component is calculated. The calculated silence and 
speech probabilities of the vector conq?onent are indicative wheflier the selected feature 
vector component represents speech or a speech discontinuity. step 208 a mean value 
30 forsilenceofthefeaturevectorcon5>onentiofaUfeaturevectorsjiscalculated.Instep 
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210 the appropriate mean value for speech of the feature vector coicponent i of all 
feature vectors j is calculated. 

The calculation of the mean values for silence and the mean values for speech of a 
distinct component, i of all featMg_yectors j is based on a moving .weighted average 
function. In step 224 and 226, appropriate mean values for silence and mean values for 
speech for a distinct feature vector component i of the training feature vectors for all 
feature vectors j of training data are calculated and provided to step 212. Based on the 
selected feature vector component, the calculated silence probability of the feature 
vector component 204 and the calculated speech probability of the feature vector 
component of step 206 as well as the silence mean value of step 208, the speech mean 
value of step 210 and the silence and speech mean values of the training data of step 
224 and step 226, the selected feature vector component is transformed into a new 
feature vector component in step 212. 

The generated mean values for speech and for silence give an indication of 
environmental mismatch when compared to the appropriate mean values for silence and 
speech of the training data that were recorded under e.g. ideal, hence noise-less, 
environmental conditions. When the transformation of the feature vector components 
has been performed in step 212 the newty created feature vector components, hence the 
environmentally adapted feature vector components, are submitted in step 214 to the 
speech recognition module. After the adapted feature vector components have been 
submitted in step 214, the method checks whether the index i of the component of a 
feature vector is larger or equal the number m of components of a feature vector in step 
216. If in step 216 the component index i is smaller than m, the nuniber of components 
of a feature vector, then the component index i is incremented by 1 and the method 
returns to step 204. When in the other case the component index i is larger or equal the 
number of components of a feature vector m the method proceeds with step 218 in 
which the entire feature vector is subject to speech recognition performed by the speech 
recognition module. After the speech recognition of step 21 8, the step 220 checks 
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^ » Step m in fl« ofter case. j is or e<pal n. aU «^ vecton, have 
been uansfonned and the mefliod sB^is in step 222. 



h. ..rfer «duce oon.p,«««. ti- and to inc^ase the efficiency .f .he env^^ 

«bp.rti<»,,„e.hod.thecaJc«l«i<m<rffl»n»».vah«stesitanceands^ 
a«l 210 not necessaray has to hwaude an feature v«=lo«. instead the cataaaa» 
_ rilence speech vah«s can also he based on a subset ofleanne vector, in^ 

. ease the n^an ^ silence and speech of the training featu^ ve^ 
the steps 224 and 226 also l»»e to be b.s«l on the appropriate subset of training ftrtur^ 

actors. h.towno.theentire.y of ii»-u-.vec.<»»andtrainingfta.m^ 
«, be «ton into account for the calculation of n-ean vatae. te silence and speech 

..eeessary to an flie environmental adistation of the teitore vectors. 



mean 
10 a 



15 



Fig»e 3 iltastrates a typical probdrility auction fcr to calcuMon of sflence 
p^ty of a iean^e vector con^ Tl» d^dssa 300 represents the nioduh^ 

vector con^ponent. whereas the ordinate 302 gives the appropriate siW 

p,<*^bynteansof.hefunc.ioninustrMedby.hegraph304.Thepr6^ 

20 fi™=,ionaccordh..«.tbeinve«tioncanh.princip.eberepresentedbyanyn»^^ 

aee^g fin^on. The ftnction 304 is only an e^u^le of a Sigmoid mc^ ^ 
is counnonlyused&rprobabiUtydistribation.taspeechrecognition systems, 
fte ,„b*iUty fbncdon gives a silence p«.babimy around 0.5 fcr the sum of the mean 

value fcr silence phis flie ^ipropriate variance value. 



25 
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FigM»4showsablockdiagramof.speechrecogmtionsystem402wifl>an 

«.Zm«»«a adaptation accordh^ to the present invention. Gen^ 
i^utted into the ^ .ecoguiti«. system 402 which perfcrms a ^ to te« 

,,,,,,ta„^on«i.hthetext404beh«ou.pu.tedfe»nth^ 

402 Thespeechrecognitions,st8m402comprisesaieantrevec«»generaaonmodule 
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406, an environmental adaptation module 408 and a speech recognition module 410. 
Furihermore the speech recognition system conqnises training feature vectors 412 as 
weD as memory modules 414 and 416 for storing and providing silence and speech 
probabilities as well as silence and speech mean values of the training feature vectors 

5 



The environmental ad^tation module 408 comprises a silence and speech probability 
module 418, a silence and speech mean value module 420 as well as a feature vector 
transformation module 422. 

10 

Recorded speech 400 is transmitted to the feature vector generation module 406. The 
feature vector generation module 406 performs the necessary steps in order to generate 
feature vectors m the log-spectral domain for speech recognition purpose. The generated 
feature vectors are then transmitted to the silence and speech probability module 418 
1 5 and to the silence and speech mean value module 420 as well as to the feature vector 
transformation module 422 of the environmental adqitation module 408. The silence 
and speech probabiUty module 41 8 calculates a speech and silence probability for each 
feature vector coniponent in the same way as the silence and speech mean value module 
420 calculates mean values for speech and silence for each feature vector component 

20 

The so generated silence and speech probabilities as well as silence and speech mean 
values for each feature vector coirponent are transmitted to the feature vector 
transformation module 422. Based on the transformation function, the specific feature 
vector component, the sileace and speech probability as well as the mean values for 
25 silence and speech and the silence and speech mean values of the training feature 

vectors 412, the feature vector transformation module 422 performs a transformation of 
the spedjQc feature vector conqionent 

Since the transformation is performed for each coniponent of all feature vectors, the 
30 entirety of feature vectors generated by the feature vector generation module 406 is 
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environinentaUy ad^ted by creating a new set of feature vector components ftat are 
submitted 10 the speech lecognition module 410. In liie speech recognition module 410 
the environmentally ad^ted ffeature vectors of ihe speech 400 are conq»ared with 
training feature vectors 412 in order to assign portions of speech to text and text 
pl^gS,_The recognized speech is.theQjtoally.<»i1pujtted as.tSXt4D4 
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LIST OF REFRRFNrP. KrTTMTTP AT c 


400 


Speech 


402 


Speech lecognition system 


5 404._ 


Text 


406 


Feature vector generation module 


408 


Environmental adaptation module 


410 


Speech recognition module 


412 


Training feature vectors 


10 414 


Memory for probability of training feature vectors 


416 


Memory for mean values of training feature vectors 


418 


Probability module 


420 


Mean value module 


422 


Feature vector transformation module 
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CLAIMS 
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1 .A'^^d oifeattvko^i^iM^Won of a ^ch lecognition system (402) piovidiiig a 
sequence of feature vectors, each feature vector being descriptive of a power speclium of 
speech (400) to be recognized, for each feature vector component, the mefliod conqadsing 
the steps of: 

calculating a sUence probabilily of the feature vector coii5)onent by means 

of a monotonous decreasing probability fimction, 

providing mean values for sflence and speech intervals of respective 

conqionents of at least a sub-set of training feature vectors, 

calculating a mean value fiar silence and speech intervals for the feature 

vector componentby means of a mean functionbased on at least a subset of 

respective feature vectors, 

transforming the feature vector conqKment by means of a transformation 
function, the transformation function being based on the mean vialue for 
silence and speech of the feature vectors andthe training feature vectors, the 
silence probabiHty of the feature vector component and on the feature 
vector component itself. 

2 . The method according to claim 1 , the method furflier for each feature vector component 

con^prising the steps of: 

calculating a speech probabiUty for speech by means of a monotonous 

increasing probabilily fonction, 

transforming the feature vector component by means of the ttansformation 
function, the transformation fimction being forther based on the probabihiy 
for speech of the feature vector compmient 



25 
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3. The method accarding to claim 1 or 2, wherein the mean fimction is a moving wei^ted 
average fimction, the calculation of the mean value for silence and speech intervals being 
based on the subset of feature vectors, the subset comprising at least a number of 10, 
preferabfy a number of 20 to 30 feature vectors. 

4. The method according to any one of the claims 1 to 3, wherein providing of mean values 
for silence and speech intervals of the training feature vectors is based on a training mean 
fimction, which is a weig^ited average fimction for a subset of training feature vectors, the 
subset at least a number of 10, preferably a number of 20 to 30 feature vectors. 

5. The method according to any one of the claims 1 to 4, wherein the probability fimction 
comprises a slope constant (q being descriptive of the slope of the monotonous probability 
fimction, the slope constant being modifiable. 

6. The method according to any one of Hoe claims 1 to 5, wherein the transformation of the 
feature vector component is given by: 

^F^oUi+iMTR^ --M^yp^ ^(mTR^p ^^^K^ 
where: 

F^ „^ : transfonned feature vector componeiil^ 
F^^^^ : feature vector compoB^t, 

MTR^ : mean value for silence of training feature vectors, 
MIR^ : mean value for speech of training feature vectors, 

: mean value for speech of feature vectors, 
Mj^ : mean value for silence of feature vectors, 
Psa : silence probability, 
' speech probability. 
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7. The method according to any one of tiie claims 1 to 6, herein the silence probability 
function is given by a Sigmoid fimction of the form: 

1 



and the speech probability functian being given by. 



where: 

: mean value for silence interval of the speech, 

V : variance from the mean value for silence, 

a : slope constant, 

F : feature vector component. 



o. ^ system tor speecn i^cogmuon k-^., w.u. ..,^„^.atal adaptation, providmg a 
sequence of feature vectors, each feature vector being descriptive of apower spectrum of 
speech (400) to be recognized, for each feature vector con^onent, the system conqmsing: 

means for calculating a silence probabiHty (418) of the feature vector 
component by means of a monotonous decreasing probabiHty function, 
means for providing mean values (41 6) for silence and speech intervals of 
respective con^onents of at least a sub-set of training feature vectors, 
means for calculating a meaa value for silence and speech intervals (420) 
fbr flie feature vector component by means of a mean fimction based on at 
least a subset of respective fbature vectors, 

means for transforming tiie feature vector component (422) by means of a 
transformation fimction, the transformation fimction bang based on tiie 
mean value for silence and speech of the feature vectors and tiie training 
feature vectors, the silence probabiHty of the feature vector component and 
on the feature vector component itself. 
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9. The system according to claim 8, the system for each feature vector component 
comprising: 

means for calculating a speech probability for speech (41 8) by means of a 
monotonous increasing probability function. 



1 0. The system according to claim 8 or 9, wherein the mean function is a moving weighted 
1 0 average fimction, the calculation of the mean value for sil^ce and speech intervals being 

based on the subset of feature vectors, the subset comprising at least a number of 10, 
preferably a number of 20 to 30 feature vectors. 

11. The speech recognition system according to any one of the claims 8 to 10, wherein 
1 5 means to provide mean values for silence and speech of training feature vector components 

(416) con^qnise storage means in which the mean values for silence and speech of training 
feature vector con::ponents are stored. 

12. A conqputer program product with computer program means for a system for speech 
20 recognition (402) witb environmental ad^tation providing a sequence of feature vectors, 

each feature vector being descriptive of a power spectrum of speech to be recognized, for 
each feature vector component the computer program product comprising program means 
for: 



5 




25 




providing mean values for silence and speech intervals of respective 
components of at least a sub-set of training feature vectors, 



30 
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transforming the feature vector conqjonent by means of a transformation 
function, the transformation function being based on the mean value for 
silence and speech of the feature vectors andthe training feature vectors, the 
silence probaWUiy of tiie feature vector component and on the feature 
"i?iector component itself " "" 



13. The computer program product according to claim 12, for each feature vector 
component the computer program product comprising program means for: 

calculating a speech probability for speech by means of a monotonous 

increasing probability fiinction, 

ttansfoiming the feature vector component by means of flie transformation 
function, the transformation functionbeing&rther based on thepiobability 

for speech of the feature vector conqponent. 

14. The conq)uter program product according to claim 12 or 13, whereinthe mean fenction 
is a moving weighted average fimclion, the calculation of the mean value for silence and 
speech intervals being based on the subset of feature vectors, the subset coinprising at least 
a number of 10, preferably a number of 20 to 30 feature vectors. 

15. The con^uter program product according to any one of the claims 12to 14whereinthe 
tomsformation of the feature vector conq?onent is given by. 



-^{MTRsa -Msa)Psa +(mTR^ ''M^)p^, 



where 



transformed feature vector conoponent. 



MTRsa- 



feature vector component, 

mean value for silence of training feature vectors, 
mean value fisr speech of training feature vectors. 



Ma, : mean value for speech of feature vectors. 
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15 



M^^ : mean value for silence of feature vectors, 
Psa : silence probability, 
• speech probability. 



5 16 The computer program product according to any one of the claims 1 2to 1 5 wherein the 
silence probability function is given by a Sigmoid function of the fomi: 



1 + e:q>((M^, + F^/ - / ) ' 



1 0 and the speech probability function being given by: 



P =1-P 



where: 

: mean value for silence interval of the speech, 
Vsa - variance from the mean value for silence, 
a : slope constant, 

: feature vector component 



20 
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ABSTRACT 



Adulation of environment mismatcli for speech recognition systenas 

The present invention relates to a method, a syslmi and a conq)uter program produ^ 
speech recognition with environmental adaptation. Feature vectors being descrq)tive of a 
power spectrum of incoming speec h are tra nsformed in order to eliminate environmental 
"mismatchbetweenthereciding'conitionsoftcainmgs^ 

of tiie speech being subject to speech recognition. The mefliod is based on a probabiUly 
whether a received sound interval represents speech or a speech discontinuity. Determining 
mean values for sound intervals representing speech or speech discontinuity andcomparing 
said values wifli respective values of the trainmg data, atiansfiomiation of generated feature 
vectors can be performed in oider to reduce the environmental mismatch. Titel 
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Trained speech 
References . 
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Speech input 




100 




Signal Analysis: 

-framing and hamming window 
-power density computation (FFT) 
-smoothing, generation of feature vector 
-environme!TtaLacla|>tatlpn, 



-additional steps (e.g- cepstral transformation) 
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Output of 
recognized text 
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Fig. 1 
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select feature vector j = 1 
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select vector component i = 1 of 
feature vector J, F(ij) 
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calculate PgnOJ) 
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calculate P8p(iJ) 
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provide MTRsii(i) for all j of training 
feature vectors 



calculate Msii(i) for all j of subset 
of feature vectors 
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calculate Msp(i) for all j of subset 
of feature vectors 
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transform F(iJ) to F'(iJ) 
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submit F'(iJ) to speech recognition 

module 
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End 
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provide MTRsp(i) for all J of training 
feature vectors 
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